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Abstract 

Nowadays, with the widespread of smartphones and other portable gadgets equipped with a variety 
of sensors, data is ubiquitous available and the focus of machine learning has shifted from being able 
to infer from small training samples to dealing with large scale high-dimensional data. In domains 
such as personal healthcare applications, which motivates this survey, distributed machine learning 
is a promising line of research, both for scaling up learning algorithms, but mostly for dealing with 
data which is inherently produced at different locations. This report offers a thorough overview of 
and state-of-the-art algorithms for distributed machine learning, for both supervised and unsuper¬ 
vised learning, ranging from simple linear logistic regression to graphical models and clustering. 
We propose future directions for most categories, specific to the potential personal healthcare appli¬ 
cations. With this in mind, the report focuses on how security and low communication overhead can 
be assured in the specific case of a strictly client-server architectural model. As particular directions 
we provides an exhaustive presentation of an empirical clustering algorithm, k-windows, and pro¬ 
posed an asynchronous distributed machine learning algorithm that would scale well and also would 
be computationally cheap and easy to implement. 


1 Introduction 

Machine learning offers a wide variety of algorithms for mining and prediction and traditionally the bottleneck was the 
limited amount of data available, with larger datasets being necessary for high accuracy E). However, nowadays the 
limiting factor is the problem of learning algorithms computational time not scaling well with the data size, making 
learning from large datasets in a reasonable time impossible. In this context, distributed learning is a promising line of 
research since allocating the learning process among several computing nodes is a natural way of scaling up learning 
algorithms, allowing us to deal with both memory and time constrains. Moreover, it allows to deal with data sets 
that were naturally produced from many physically distributed locations and/or where transmitting large amounts of 
data to a central processing unit is expensive or impractical, situations which is becoming more frequent in many real 
applications ll50l . In some cases the data might contain sensitive information, such as medical or financial records, or 
data that by nature can not be exchanged with the outside of a company, but still companies would like to exchange 
knowledge or cooperate, for example, in preventing fraudulent intrusion ||33l. At a high level, distributed learning 
is the cooperation of experts, each of which has a different perspective based on own observed data, and in which 
by message passing the agents communicate to improve the quality of the learned knowledge. Working with mobile 
agents adds to the challenges of coordination and privacy another stringent factor, low communication overhead. 

The potential of machine learning to healthcare application has been recognized from some time OllllTl and also the 
benefit of multi-agent systems ll44l . but in recent years the diversity of sources and the amount of available information, 
more than 420 million radiological images are generated in US alone ||2l, have created the need to adopt distributed 
machine learning techniques 1134115^ . Furthermore, with the rise of health-tracking apps for mobile phones, user 
activity and sensor data are becoming widely available, but the data is mostly used to generate simple statistics and 
visual attractive activity history plots. Even if efforts are made to incorporate machine learning methods that use 
behavioral data to generate personalized feedback El, applications that work by combining users information to 
increase accuracy of machine learning methods and that can infer and suggest actions beyond user input, such as in 
collaborative filtering, are just now emerging, e.g. ||32l. This subject is just beginning to attract more attention with 
the creation of Pittsburgh Health Data Alliance 0 and MRC Health eResearch Centre 111 and many more focused 
research centers m- Potential to extract useful public-health data goes beyond electronic health records (EHR), with 
results emerging from web search queries Il24l and even social media ifTSl . With the emer genc e of cloud applications 
for health data 0, privacy of sensitive health data is becoming more of a requirement ITtI l68ll . In this paper we 
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will revisit some of the key advances in distributed machine learning in the hope that they will prove valuable for the 
research performed in personal healthcare applications. 

In the rest of this report we will provide the general problem we are trying to solve in Section 2 and we will present 
the distributed machine learning algorithms classihed based on if a label for the data points is provided or not in 
either supervised or unsupervised in Section 3, respectively in Section 4. For supervised learning we will look at the 
parametric methods of linear regression and Support Vector Machines (SVM), non parametric methods of Gaussian 
Processes (GP) and probabilistic graphical models (PGM). For unsupervised learning we will focus just on basic 
clustering methods. Before concluding we present in Section 5 a general implementation for large scale data and our 
approach to an asynchronous distributed machine learning algorithm. 


2 Preliminaries 


In this section we will introduce the basic model we will be focusing in the rest of the report. In classical machine 
learning we are provided with a dataset V consisting of N points G K" and in the case of supervised learning 
we also receive a real value or class label for each point (or we will need to infer it in the case of unsupervised 
learning). The goal is to find a set of parameters 6* that will minimize a loss function /(x, yjff) over the entire dataset 

0* = argmin^^j^ /(x^, yi|6*). In distributed machine learning we will consider we have K nodes (or agents), each 
e 

(k) (k) 

provided with a dataset Vk consisting of Nk points x) G K" and respective labels if provided and the goal 

is for each agent to learn the same parameter 0^'^^ * = argminX^^i There might be cases 

e 

were we would want instead of a global optimizer, to have each node learn local optimizers specihc to the dataset 
they have or even a mixture of local and global optimizers, but this will not be the focus of this article. In the 
case were ^ = 'D, the distributed algorithm ideally should converge to the same set of 

parameters, i.e. = 0*. For evaluation of the performance of the algorithms, in the case the last condition is 

satished, we have the same performance of the non-distributed learning methods. If this is not the case then we need 
to dehne a measure of how far the distributed algorithms result diverges from the true one, considered here the non- 
distributed algorithm. Because we can not have the pie and eat it to, we will incur some overhead from performing the 
computations distributed, in which case we need to compare performance in complexity, computation time, memory 
or communication extra costs. Privacy will be also a concern if transmitting (x^, is forbidden. 



(a) Sensor network (b) Hierarhical network (c) Central server network 


Figure 1; Possible types of network topologies, where the green circles denote data nodes and the yellow arrows show 
example of information network flow 


From the point of communication between nodes, the network architecture can be categorized in three scenarios, 
depicted in Fig.f^ In the case of sensor networks ED a node is allowed to communicate only with its immediate logical 
or physical neighbors, whereas in the of hierarchical networks the connectivity graph resembles a tree structure and 
nodes are allowed to communicate only with its immediate ancestor and children. Usually in this case the leaves are the 
data nodes where the distributed dataset Vk resides and upper level nodes are used for aggregation and collaboration. 
The central server network is a special case of hierarchical architecture in which we have a two level deep tree. The 
focus of this report is to look at distributed machine learni ng algorithms in the scenario |l(c)| although a hardware 
settin g like in |l(c)| can have a logical structure like in |l(b)| It is not straightforward how an algorithm designed for 
|l(a)| can be adapted to |l(c)| but a possible starting point is to create a common neighbor designated as the central 
processing unit. 
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3 Supervised learning 

3.1 Linear regression 

For linear regression we have ?/i G K and /(xj, ?/i|0) = 0.5||yi — ax, — 6|j|, where 9 is a G M” and 6 S M. 

El and a provide parallel solutions to logistic regression (with regularization) that hinge upon an Allreduce function 
(from MPl architecture). This function can be simulated by a two step communication with a central server, first each 
node sends to the server the current local estimate and then al of the nodes receive back from the server the 
optimal global parameter 0 as a function of this local estimates. Using a regularization function g{9) (if g{9) = ||0||i, 
the problem is known as LASSO) and introducing the variables Zi = axi + b, we end up with a constrained convex 
optimization problem of the form 9*, z* = argmin^^j^ f{zi,yi) + 9i^j) that Zi = ax.; + & for 1 < 

i < A^. In the distributed case we introduce additional constrains of the form 9^^'> = 9 and z^ = 
account for nodes having access only to local copies of the variables. Application of the DouglasRachford splitting 
(also known as ADMM) to this optimization problem leads to a three stage algorithm with several proximity functions 
carried in parallel at each node and two Allreduce functions, one for coordination on 9 and one for coordination on 
z, repeated until convergence. Q proposes use of a quasi-Newton method which reduces the number of iterations, 
but increases communication cost per iteration. L-BFGS is implemented distributed using just one Allreduce function 
to compute the optimal step and to synchronize state, such that each node maintains the rank-1 updates and L-BFGS 
is only run locally. Furthermore,local stochastic gradient evaluations are used. The authors in ll4^ use a similar 
approach of operator splitting applied to a Coordinate Descent algorithm. They argue that a greedy selection of update 
coordinates is better suited for sparse solutions instead of cycling in Gauss-Seidel fashion or random selection and 
imply it outperforms the parallel version of ADMM |jT] . 

Privacy of user data is assured in @ by sending the empirical second order statistics. If the user data matrix, where each 
row is a data point or outcome, are X and Y we can find the global optimizer by computing 9 = X)~^X'^Y = 

W-^V, where W = and V = k = with X^'^'> and Y^^l = 

where X^^'> and Y^^'> denote each node partition of data. This work is motivated among others by a health 
application, but privacy can not be assured when user data is highly correlated or when privacy of some user data 
distribution is needed and can be used only for masking exact data points values. 

In Section 5, we will detail the method in llJTl which uses as example -regularized logistic regression. 

3.2 Support Vector Machines 

We will look at Support Vector Machines (SVM) for classification, where yi G {—1,1} is defined as the class label and 
/(xi, yil9) = [1 — 0]+, with [u] = max(0, v), (l)i = 4>(xi) G for a given basis function (/>(■), and 9 G We 

have also a regularization function given by g(9) = O.5||0||2 and regularization parameter A, but different penalization 
functions can be applied leading to other maximal margin classifiers, e.g. g(9) = ||0||i for Sparse-SVM. SVM is very 
popular since the dual of the convex optimization problem needed to learn 9, is a quadratic (least-squares) constrained 
problem, i.e. max — a(y^<l><l>^F)a^, that can be solved by a greedy like algorithm (Sequential Minimal 

Optimization), it is naturally kernelized by K = (only positive definite kernels work), and the solution a is 

sparse. Although 9 can be recovered easily, the decision is taken only considering examples that fall at the boundary, 

called Support Vectors (SVs), < 0(2^); 4>i >= a is sv ^ 4’i >> which makes the SVM method 

extremely fast for classification of a new example. 

A first method that only exchanges the SVs is described in as cascade SVM, where SVs from each node are send 
to the central node and another round of SVM training is performed on the received SVs. The new SVs are send back 
to worker nodes where they are added to the training set and the procedure is repeated recursively until the SVs from 
one round to the other do not change. A generalization of the cascade SVM was introduced in HOt . where in a directed 
graph network topology a node combines the SVs received from predecessors with local SVs and pushes the result 
to its successors. Use of this algorithms using the MapReduce paradigm is presented in Q and we are also offered a 
good introduction to the Hadoop architecture. 

The Alternating Direction Method of Multipliers (ADMM) has been used in various scenarios where distributed algo¬ 
rithms are posed as a consensus problem and for SVM we can offer as examples El, where the consensus problem 
is formalized on the primal and allows for an easy extension to an online algorithm, and more recently ifTTI . where the 
consensus problem is extended also to using a — £2 regularizer. While training is performed independent on each 
node by solving the dual problem for of the corresponding local examples, in ifTOl average consensus between 
neighbors is used for classification of a new example without partial training classifiers being combined and it is rerun 
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for each presented example. In ll72]| a modified dual problem for sparse SVM learning based on removing the inequal¬ 
ity constrain and adding an indicator function to the objective is solved again with the use of ADMM. However it is 
not transformed in a consensus problem and the distributed algorithm simply hinges upon the parallelization of the 
proximal operator for separable objective functions naturally arising in machine learning problems. 

The Extreme Support Vector Machine (ESVM) classifier ll39l can be trained much faster than other nonlinear SVM al¬ 
gorithms since it requires the inversion of small matrices and in parallel and incremental algorithms are proposed, 
but they require shuffeling of the entire set of points, leaving the problem of distributed computation open. 

Not encountered in the literature review, a consensus problem may be formulated for the dual SVM problem as well, 

in which a node would set some of its local SVs to zero, i.e. a] =0, to satisfy consensus and minimize the objective 
function along with the overall loss. In this case, however, the Lagrange multipliers parameters choice may play an 
important role since consensus would be reached by ignoring nodes with more examples and that would eventually 
contribute to the margin error in a greater proportion than nodes with less training examples in the margin. At least 
weights proportional to the number of training examples of each node should be used, although other factors would 
contribute, such as the actual distribution of examples (a node that contains only non essential SVs should ideally have 
a weight of zero). 


3.3 Gaussian Processes 


In the following two sections we will not keep our notation and minimization of loss function /(•) framework. 
In Gaussian Process(GP) regression we define a Gaussian process prior over latent function f as p(f|A') = 
JV{n{X),K{X,X)) with K{X, X)ij — k{xi,Xj) for some kernel function k and a likelihood over the data val¬ 
ues y as p(y|f) = JV{{,aI]s[) which corresponds to measurements of the form y = f -f e with e ~ A/'(0, ct/at). 
Prediction /* at a new point x* is obtained using the fact that /* and y are jointly Gaussian 




K{X,X) + alN 
Ki^\X) 


K{X,^*) ]\ 
iT(x*,x*) \) 


The marginal p(/*|y) is a Gaussian with mean p* and variance a* given by 

ii* = + iT(x*, X) [K{X, X) + alN]~^ (y - li{X)) 

a* = K{-x*, X*) - K{x\ ) [K(X, X) + K(X, x*) 


GP is a non-parametric method, however we have a set of hyperparameters 9 that define the process, e.g. the kernel 
might depend on some scale variable. To estimate the hyperparametes we maximize the marginal log-likelihood 

p(y|X, 9) = -0.5(y - ti{X)f [K{X, X) + (y - fi{X))^ - 0.5 log det(iT(X, X) + al^) ED 

A practical limitation of Gaussian Processes(GP) is that training and predicting have 0{N^), respectively 0{N‘^) 
running times.In sparse GP, the true posterior GP can be approximated by averaging over a small set q of auxiliary 
inducing variables f^ drawn from the same GP prior as the training function values evaluated at pseudo-inputs Xq 
independent from the training inputs and that ideally should be sufficient statistics for f (y* and f are independent given 
iq) and thus /i* and a* are now computed using Xq, instead of X. 1^ selects the inducing points by minimizing the 
KL divergence between actual and computed joint distribution of f, f^. Recently ll23l used the sparse approximation 
GP introduced in ll66ll to carry the computation in an embarrassing parallel model on each node and to coordinate uses 
a central node to accumulate partial results and redistribute back the aggregated value to the network of nodes. 

Computations can also be distributed by using independent local models, where predictions are made by weighting 
the predictions of local experts. One approach called Mixture-of-experts (MoE) assumes there are P independent 
experts f^^'^ ^ A/'(0, k^P\Xp, Xp), each with a set of hyperparameters and an auxiliary latent variable z„ associated 
which each datapoint that identifies the expert the noisy corrupted measurement y„ = (x„) + e„. comes from. We 

can have hard assignments where one datapoint comes form a single expert or soft assignments where z„ indicates 
the proportion of one expert in the mixture of each datapoint. MoE has the drawback it requires Markov Chain 
Monte Carlo (MCMC) or variational approximations to assign data points to GP experts and, in order to bypass this 
problem, ll46l proposed a method in which each expert is augmented with a set of inducing points and assignment is 
done by proximity to experts with the use of a MAP estimator. The estimator boils down to = argmin (x„ — 

pe[i....,P] 

m.p)'^V~^{xn — nip), where nip and V (diagonal) are quantities based on each expert set of inducing points. 

The Product-of-GP (PoE) models multiplies independent GP experts contribution (naturally weighting each con¬ 
tribution), but tend to be overconfident ED. The marginal likelihood in PoE can be factorized according to 
the local datasets (but not necessarily) as p{y\X,9) « H^i which transforms the objective 

function used for training (log-likelihood maximization) in K separable. After training the prediction can be 
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computed as p{f*\x* ,Y) = with the solution {a*) ^ ^ p* = 

p^^\'x*). The generalized PoE-GP model lfT3l models the importance of GP-experts by 
weighting contribution as p(/*|x*,y) = ^iP^''(/*|x*,y(^)), which reflects in the solution as (a*) ^ = 

J2k ^ and/r* = {cr*f J2k Pk (cr^^H^*)) {x*). For /3fe = 1, the model falls backs 

to the prior outside the range of the data, but the drawback is that in the range of the data it overestimates and 
with an increasing number of experts it becomes to conservative. However in a central server model coordi¬ 
nation to ensure /3fc = 1 is easy to accomplish. The Bayesian Committee Machine(BCM) ll67l explicty 

incorporates the GP prior into predictions as p(/*|x*,y) = JIaLi ’ which reflects in the solution 

as (cr*)”^ = (tT(^'i(x*)) ^ -I- (1 — M)aQ^ where (Tq is the prior variance of p{f*). The advantages of 

this model is again that the correction term ensures a consistent model that falls back to the prior, but it may 
become inconsistent if a GP expert relies on a small number of datapoints. In ifTTll proposed the generalized 
Bayesian Committee Machine (gBCM), which combines the generalized PoE-GP model use of additional weights 
to model importance of experts with the BCM model which incorporates the GP prior when combining predic¬ 
tions. The marginal likelihood becomes p(/*|x*,y) = IIa—i which reflects in the solution as 

= Ilk X*)) ^ -I- (1 — /3fe)(Tg ^ where erg is the prior variance of p{f*) .The advantage of the 

gBCM is that it avoids the drawbacks of the BCM which suffers from weak experts and that of the PoE-GP that over¬ 
estimates the variance and also can be layered in an arbitrary depth tree structure with each leaf representing a GP 
expert. 

Rather than forcing the nodes to collaborate in building a common GP model (requires central node of coordination in 
above mentioned sources),one can imagine an approach in which each node builds a local MoE model using own data 
(one GP expert) and information shared by neighbors/rest of network (other experts). 

3.4 Probabilistic graphical models 

The most common types of probabilistic graphical models are Bayesian Networks (BNs) and Markov Random Fields 
(MRFs). The BNs can be visualized as a directed acyclic graph where the nodes are given by random variables and 
edges correspond to influences of one Random Variable (RV) over another and each RV is associated with a conditional 
probability distribution given its parents. MRFs can be visualized in the same way only as a undirected graph. As in 
the case of GPs, graphical models provide an approach to deal with uncertainty and have become an extremely popular 
tool for modeling uncertainty. 

The computation of the gradient for the Maximum Likelihood Estimator (MLE) in undirected probabilistic graphical 
models is difficult since while the data term is easily computed, the model term requires evaluating a sum with ex¬ 
ponentially many terms over the model distribution. MCMC is a common approach 0, but it is hard to implement 
in a distributed fashion. Replacing the MLE with a composite estimator, the Maximum Pseudo-Likelihood Estimator 
(MPLE), in which the conditional distribution of a RV depends only on the neighbors in the Markov Graph, then 
the gradient becomes data-dependent only, but the same parameter needs to be shared across multiple factors (not 
distributed friendly computation). In 13^ treat it as a consensus optimization problem and solve it using ADMM in 
a general setting. In ll4^ the composite estimator is created by defining an MRF for a all the variables in a clique q 
as well as all the variables sharing one neighbor and creating thus a sub-problem for each maximal clique,consensus 
being achieved in one step, but requires a restrictive factorization. Further in ll43l . they continue and are able to import 
the general conditions from lIMl and prove the MPLE as a special case. 

4 Unsupervised learning 

4.1 Distributed clustering 

We will now return to the framework of using a loss function /(•), with the particularity that the class labels pi are 
not provided and need to be inferred. The purpose of clustering is to find D clusters, each with a prototype vector 
Pd, and assign the data points x„ to this cluster such that a distance measure between the prototype vectors and the 

datapoints assigned to them is globally minimized. In the case of hard assignments /(x„) = — PdWi, 

and the assignment to cluster d is given by Vnd = 1 for the binary variable The optimization algorithm is known 
as Expectation-Maximization (EM) and consists of iteratively applying two successive stages: in the E-step pd is kept 
constant and the optimization is carried on rnk which corresponds to assigning datapoins to clusters, in the M-step the 
assignments Vnk are fixed and the optimization is carried on the cluster centroids pd- The algorithm is guaranteed to 
converge to a local minimum. After the training is performed a K-d tree can be constructed to minimize the time it takes 
a newly point to be labeled as belonging to a particular cluster. This is a binary tree where the branches are separated 
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by Ci < h and Ci > h, for some i dimension and a value h. The leaves contain information about which Voroni regions 
intersects the subspace defined by each of the leaves. This tree can be constructed using statistical information or by 
using just the training sample data ll55]l . Since ultimately new datapoints could influence the structure of the clusters, 
in 1651 analyzes the Hartigan method (which resembles a submodularity argument) comparing it to the k-means online 
version, but leaves a lot of open questions regarding performance optimization and potential interesting results. 

For the case of soft assignments a more probabilistic approach has to be followed. One approach is to use a Gaus¬ 
sian mixture model (although any other conjugate distributions pairs could be used), where the marginal distribution 
p(x„) = '^d) and a latent variables z„ with the marginal (or prior) p{znd = 1) = is used. The 

cluster soft assignment is given by the conditional probability p{znk = l|xra) known as responsibilities. Training is 

performed by maximizing the log-likelihood lnp(X|7r, p, S) = with the use of 

the EM algorithm by optimizing the responsibilities in the E-step and Gaussian experts mean and variance, and prior 
of the latent variables z„ in the M-step. 

Considering homogenous data distributed across nodes with each node dataset having the same number of clusters 
and clusters across nodes having the same underlying distribution, 1^ construct the convex hulls of the clusters and 
then averages them at a central node. Considering heterogeneous data, summarizes data in a node by a set of 
points, such that every point has a minimum number of neighbors and all neighborhoods do not overlap, referred as 
DBSCAN II 20 I . Such a summarization has low transmission cost, as the number of transmitted representatives to a 
central server for global aggregation is much smaller than the cardinality of the complete local datasets. Using this 
representatives, the global clustering is run at the central server using a similar non-distributed algorithm, adapted 
from DBSCAN. Since the model is the same in each agent when dealing with homogenous data, ADMM can also be 
used in a consensus problem for the k-means centroids Gil. 

In EtI the problem of transmission cost is resolved by dynamically constructing the clusters locally based on a 
parameter T which defines the maximum radius of a cluster and forming new clusters for points outside, then iterating 
in transmitting the centroids with summary statistics to a local server which decides if merging of clusters is necessary 
and sending back aggregated summaries. 

4.2 K-windows 

It was suggested to analyze in depth the K-windows algorithm proposed in j^. The algorithm consists of con¬ 
structing an initial window(actually an orthogonal range space or box) for each cluster and then iterating in the same 
manner as the K-means algorithm, assigning datapoints to clusters in the E-step and centering in the M-step. The 
algorithm is then continued in a second phase where the clusters are enlarged to incorporate as many as possible suit¬ 
able examples. The algorithm is highly empirically constructed and although several papers have been published no 
convergence guarantees or proof of optimality is pres ented. Many of core components or parameter selection are not 
precisely explained as for example, from 16^ to iM] the enlargement procedure has been changed from maintaining 
the cluster centroid unchanged within a predetermined region to the number of newly added examples to be above a 
given threshold. The algorithm also has a third phase added in a later publication in which clusters are combined if 
the number of overlapping examples is greater than a given threshold or if simply if the windows overlap more than a 
relative volume. The authors propose that the algorithm can be started with a large number of clusters and using this 
aggregation procedure it will converge close to an optimal number of clusters, but again no guarantees are provided 
by using this random initialization procedure. The algorithm has been applied to problems from a variety of fields 
Il60l 1(581116211481II 59 I . but our experiments using synthetic data have shown that the algorithm is not very effective in 
high-dimensional spaces, partially admitted by the authors in IMl where they use PCA for dimensionality reduction, 
and also it does not perform well when the clusters are not clearly separable. Another problem observed from running 
the algorithm is that the precision is high (due to the enlargement of windows procedure), but at the same time the 
recall rate is high considering that there is no stated solution for how assignment is handled for points that belong to 
two clusters that are not fit to be merged by the third phase. The benefit of the K-windows algorithm is however it can 
use a datastructure (see 15^ or ||5^ )which allows points belonging to a particular cluster to be easily accessed and 
speeds up the M-step considerably, although similar algorithms can be used for k-means (and in particular for k-means 
using foo norm) 

Because of the way it is currently formulated, the algorithm is not suited for rigorous analytical tools, but we can 
make the observation that the Voroni regions resemble to Voroni regions constructed by a k-means algorithm using 
foo norm. Eurthermore, the particular way the merging is performed seems to indicate that we are checking if the two 
clusters come from the same uniform distribution (the ’’overlapping” volume would be proportional to the probability 
mass of a uniform RV) and even more would be suited for test statistics which are based on ’’spacing frequencies” 
(i.e., the numbers of observations of one sample which fall in between the spacings made by the other sample which 
in our case coincide with the set of common points) Gl). There are a variety of statistical tests which can be used to 
accept or reject a hypothesis of two samples coming from the same distribution (which is the definition of a cluster), 
but this goes beyond the scope of this report. The only point we want to make is that using as priors exponential 
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families from the Laplace, Gaussian or ’’uniform” distribution the use of € 1,^2 or respectivley^oo distance measures in 
the k-means algorithm, they maximize the ML estimator. This prior are given by (x) = where Ci = 


roo 

•J-ooe 


idxi • • • cix„ = 2”, H 


where Ca = • • • f~ 

’ ^ J —00 J —C 


/_oo J-00 ~ ^ — C2 

andp£^(x) = where Coo = ■ ■ ■ dxn = 2"n!, and even if the 

correspond to an exact uniform distribution, for all practical uses it would be almost indistinguishable. 


dxi ■ ■ ■ dXn = V^" 

does not 


For this reasons we will now look at results for k-means using the £^0 norm. In an early paper on K-means BD, 
MacQueen stated that low valued p-metric spaces (e.g. p = 1) will tend to locate the centroids in areas of large 
probability, while high valued p-metric spaces (e.g. p = 00 ) will tend to have centers such that we have a spherical 
covering of the space with minimal radius. In ifT^ the hard and soft clustering algorithms for l\ and £00 metric spaces 
are studied, with results showing that the cluster centers are influenced by the initial assignment although the final 
results tend to match in most cases. The tests were performed on a small dataset and the behavior observed for l^o 
remained an open question. In ll26l . a slide variation of the metric distance might make the norm more robust 
to outliers (but it is not well defined in our opinion).The authors of Qo) use this distance and proposes a smoothing 
technique for solving the non-smooth problem of min||a; — a||oo 


In the following we will lay the translation of the empirical k-windows algorithm to a more mathematically sound form, 
hopefully to make it easier to analyze in the future. Considering for simplicity that the initial volume a is a square with 
edge length 2r (the rectangle case can be obtained by using a weighted norm, = max{uii|xi|, • • • , Wd\xd\} ), 
the first phase of the algorithm is 


n K 

i=l k=l 

or a K-means algorithm where the E-step is skipped and simply replaced with the cluster assignments Ui^k = 
l|ki-cfc||oo<r '^he M-step remaining the same. 

For the second phase for each cluster k and each coordinate d we consider and to be modified by some 

enlargement procedure and ^k,new represent the weight of each cluster. The algorithms loops, computing 


— min’y^ l||a;._e^|| _ <rll^t Cfe|l2 

Cfc *00 

2=1 

until card(l||^,_c,||^^,_„„<r)/card(l||^_c,||^^^_„„<^) < 6» (or in same versions dist(c^™, < e). 

Similarly the merging of clusters (can start by looking only at clusters for which dist(ci,Cj) < 
2max{||iuV||oo, llw^rlloo}) is done by evaluating the ratio card(l|| 3 ._c,||^^, <r)/card(l||^_c^.||^^^ <r). 

The algorithm is used with the £2 norm, but as pointed in 16^ it can be adapted to other distance measures,e.g. £ao 
which would make the comparison with the k-means more useful. ||60l adapts the k-windows algorithm to a 
’’naive” distributed implementation by merging all overlapping windows send by the local nodes at a central server, 
regardless of the number of examples in the overlapping region of the two windows and thus there is also no need for 
reiterating. This approach is oversimplified and in the case where the clusters are not clearly separated it was observed 
that it often leads to merging of neighboring clusters. While it greatly reduces communication costs, it is a direction 
that would need to be further researched. 

5 Central information server 

m applied asynchronous stochastic gradient descent to a non-convex problem of training a deep neural network and 
made use of the adaptive learning rate procedure in 03 to increase robustness. The data is partitioned across several 
nodes that create a local model replica, which communicates with a parameter server by pulling current parameters 
and pushing gradients (computed on a mini-batch). The use of a parameter server is further used in ifJTl to create a 
communication efficient algorithm when using sparse data in high-dimensional space and provide a converge analysis 
for convex objective functions. 

We propose the following algorithm in which the server in iteration t when a node would push a computed parameter 

6 the server would record this as 9t ^ 9 and would send to the node the parameter 9t-i from memory. Any machine 
learning algorithm F( ) chosen to run on each of the nodes would be effectively seen as running in isolation on the 
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local dataset, where the node k after receiving back the parameter from the server would update 0 "™ ^ F^^'> ( 5 /received^ 
and on completion would push this result to the central server. The superscript {k) means that local datasets Xk (and 
Yfe) are used by the algorithm. 

Let us assume for now that the K nodes would contact the server in a Round Robin fashion and that 6*0 (central server) 
is initialized to and this value would be used also as starting point on the first node contacting the server (node 
1). In this case, the second node contacting the server (node 2) would use the parameter 61 which was computed 
using the local dataset of node 1 as a starting point and would update it to reflect also its local dataset. The procedure 
would continue with every node building on top of the results for its predecessor, ending with an equivalent update 

of the form Ot ■<r- ^ ... F^"^^ (^ 0 )) • • • ^ If F(-) is a first order method based on a convex objective 

this is equivalent to a mini-batch gradient descent algorithm. In this case even if the final parameters stored in each 
of the nodes are not the same, if we guarantee convergences of the non-distributed algorithm, then they are all in a 
e-neighborhood of the optimal solution. If the datasets themselves are large, F{-) could also be stochastic in nature 
and the algorithm would converge as long as the non-distributed algorithm converges under a stochastic mini-batch 
algorithm. 

The extension to an asynchronous version is now straightforward if indeed the learning method is backed by a non- 
distributed algorithm for which a stochastic mini-batch algorithm exists. Whenever a nodes finishes it computations 
it contacts the server and ask for the new parameter 6 and in the unlikely case of two nodes contacting the server at 
exactly the same moment, ties can be broken arbitrarily. As in the synchronous version, every node builds on top of 

the results for its predecessor, but the equivalent update has the form Ot ^ ^ ... ^, where 

Si, S 2 ,..., S't are discrete RV from a distribution S with the support K}. By ensuring thatp(5' = i) > 0 for 

every 1 < i < K and S ^ S, which is equivalent to saying that there exists no node that will never contact the server, 
it can be proven that the algorithm indeed converges. The actual distribution S is dependent on the local datasets, e.g. 
number of examples which practically at least influence memory retrieval time. 

The analysis is not completed as we can imagine that online learning algorithms can be used and we only wished to 
detail this method which to the best of our knowledge is new and would resolve a stringent problem in distributed 
algorithms (not machine learning solely) efficient use of asynchronicity for parallel algorithms. It has the advantage 
that there is no wasted local computation, but also that it is easy to implement and also to provide convergence 
guarantees (at least if we start from non-distributed algorithm for which a stochastic mini-batch algorithm exists). 

6 Discussion 

Although a lot of progress has been made in the past decade on adapting machine learning algorithms to distributed 
settings, a lot of questions still remain open. For us the most stringent question is if we can easily import the results 
from the senor network, obtained with great effort during the hype of wireless sensor network, into a central node 
setting. Most of the distributed algorithms are obtained from trying to parallelize their successful non-distributed 
counterparts, but for most of the times we have just empirical adaptations, without proof of results equivalence and how 
the convergence rate is affected by this tradeoff. While communication costs for most supervised learning algorithms 
is reduced and also various methods of masking actual data points to ensure privacy exist, simply because most of the 
times we can get away with just transmitting the set of parameters or hyperparameters, for unsupervised clustering it 
still constitutes into a great challenge, especially in the case of heterogeneous distributed data. 

We have proposed future directions for both supervised and unsupervised learning, ranging from simple linear logistic 
regression to graphical models and clustering. The report focused on how security and low communication overhead 
can be assured in the specific case of a strictly client-server architectural model, specific to the potential personal 
healthcare applications, but at the same time there are other influential factors,e.g. performance and accuracy of a 
global optimizer versus parameters adapted to local data or a mixture of both. 

The empirical clustering algorithm, k-windows, is not as it is now a solution, many of the core components are 
not precisely defined or supported by analytical choices and influence of parameter selection is not obvious. From 
our experiments it does not perform well in high-dimensional spaces and does not perform well when the clusters 
are not clearly separable, especially in the distributed version where the aggregation procedures leads to merging of 
neighboring clusters. 

On the other hand we proposed an asynchronous distributed machine learning algorithm that would scale well and 
also would be computationally cheap and easy to implement. If the learning method is backed by a non-distributed 
algorithm for which a stochastic mini-batch algorithm with convergence guarantees exists, adapting it to our method 
is straightforward and we can also import all of the analytical results, including convergence rate which surprisingly 
would remain the same as in its non-distributed counterpart. This is an amazing result considering the fact most of the 
times there exists a tradeoff between parallelization and convergence rate. 
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